Business Problem

The objective here is to predict the RTFP\(^{NA}\) score for the next 10 years, for Canada, Mexico, and the USA. This factor represents the growth of the country based on capital received. Every year Penn World Table does the calculation to make possible compare all countries across the years.

Libs

These are the libraries for this case

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(TSstudio)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(skimr)
library(tidyquant)
## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## Loading required package: PerformanceAnalytics
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
## Loading required package: quantmod
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
## ══ Need to Learn tidyquant? ═══════════════════════════════════════════════════════════════════════════
## Business Science offers a 1-hour course - Learning Lab #9: Performance Analysis & Portfolio Optimization with tidyquant!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(ggtext)

Data

# raw data
pwt_raw <- read_csv("data/TFP.csv")
## Parsed with column specification:
## cols(
##   isocode = col_character(),
##   year = col_double(),
##   rtfpna = col_double()
## )
# first look
glimpse(pwt_raw)
## Rows: 186
## Columns: 3
## $ isocode <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA"…
## $ year    <dbl> 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1…
## $ rtfpna  <dbl> 0.6171479, 0.6295884, 0.6384513, 0.6518582, 0.6461794, 0.6687…
skim(pwt_raw)
Data summary
Name pwt_raw
Number of rows 186
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
isocode 0 1 3 3 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1980.50 17.94 1950.00 1965.00 1980.5 1996.00 2011.00 ▇▇▇▇▇
rtfpna 0 1 0.98 0.18 0.62 0.86 1.0 1.05 1.38 ▃▂▇▂▂

It’s possible to see that the series be distributed annually, and every observation indicates a different year. The data doesn’t have missing values, so will not be necessary any imputation method.

pwt_raw %>% 
  group_by(isocode) %>% 
  count()
## # A tibble: 3 x 2
## # Groups:   isocode [3]
##   isocode     n
##   <chr>   <int>
## 1 CAN        62
## 2 MEX        62
## 3 USA        62

One problem identified is that the data have 186 observations, being 62 for each country. As was asked for the next 10 years of prediction, this represents 16.12% of the actual data.

Exploratory Data Analysis

# countries histograms
pwt_raw %>% 
  mutate(isocode2 = case_when(
    isocode == "CAN" ~ "Canada",
    isocode == "MEX" ~ "Mexico",
    TRUE ~ "United States of America"
  )) %>% 
  ggplot(aes(x = rtfpna, fill = isocode2)) +
  geom_histogram(binwidth = .05, color = "white") +
  scale_fill_manual(values = c("#1874CD", "#C28E42", "#00688B")) +
  facet_wrap(~isocode2) +
  theme_tq() +
  guides(fill = FALSE) +
  labs(
    title = "RTFP^NA Histogram for the Countries",
    x = "RTFP^NA",
    y = "Frequency",
    caption = "github.com/LucianoBatista"
  ) +
  theme(plot.title = element_markdown(),
        axis.title.x = element_markdown())

The data is close to the normal distribution, but for being small data this is not too clear.

# time series visualization
pwt_raw %>% 
  mutate(isocode2 = case_when(
    isocode == "CAN" ~ "Canada",
    isocode == "MEX" ~ "Mexico",
    TRUE ~ "United States of America"
  )) %>%
  ggplot(aes(x = year,
             y = rtfpna,
             color = isocode2)) +
  geom_line() +
  scale_color_manual(values = c("#1874CD", "#C28E42", "#00688B")) +
  facet_wrap(~isocode2) +
  #expand_limits(y = 0) +
  guides(color = FALSE) +
  theme_tq() +
  labs(
    title = "Time Series for the Countries (1950-2011)",
    x = "",
    y = "RTFP^NA",
    caption = "github.com/LucianoBatista"

) +
  theme(
    axis.title.y = element_markdown()
  )

About the series profile:

Creating a ts.obj object

To use the forecast package and others, it’s necessary to have a time series object, and for that was made the conversion.

# ts_USA
pwt_usa_tbl <- pwt_raw %>% 
  filter(isocode == "USA") %>% 
  select(year, rtfpna)

# ts_MEX
pwt_mex_tbl <- pwt_raw %>% 
  filter(isocode == "MEX") %>% 
  select(year, rtfpna)

# ts_CAN
pwt_can_tbl <- pwt_raw %>% 
  filter(isocode == "CAN") %>% 
  select(year, rtfpna)

# first and the last time series year
start_point <- min(pwt_raw$year)
end_point <- max(pwt_raw$year)

# ts.obj
ts_USA <- ts(data = pwt_usa_tbl$rtfpna, # the series values
             start = start_point, # the time of the first observation
             end = end_point,  # the time of the last observation
             frequency = 1) # the series frequency

ts_MEX <- ts(data = pwt_mex_tbl$rtfpna, # the series values
            start = start_point, # the time of the first observation
            end = end_point,  # the time of the last observation
            frequency = 1) # the Series frequency

ts_CAN <- ts(data = pwt_can_tbl$rtfpna, # the series values
            start = start_point, # the time of the first observation
            end = end_point,  # the time of the last observation
            frequency = 1) # the series frequency

# to view the time series with interactive
ts_plot(ts_USA) # time series of interest
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

These will be the objects used for the forecasting.

Decomposition of Time Series Data

As was shown before, the time series doesn’t seem to have a visible seasonality. But, it’s better check that with stats::decompose() function.

To plot this, the function needs that the series has two or more periods, but all the series are yearly distributed. So, was applied one change in the frequency parameter, just for this visualization.

ts_USA_frq2 <- ts(data = pwt_usa_tbl$rtfpna, 
             start = start_point, 
             end = end_point,  
             frequency = 2) # changed

ts_MEX_frq2 <- ts(data = pwt_mex_tbl$rtfpna, 
            start = start_point, 
            end = end_point,  
            frequency = 2) # changed

ts_CAN_frq2 <- ts(data = pwt_can_tbl$rtfpna, 
            start = start_point,  
            end = end_point,  
            frequency = 2) # changed


plot(decompose(ts_USA_frq2))

plot(decompose(ts_MEX_frq2))

plot(decompose(ts_CAN_frq2))

It’s possible to see that all the seasonal components have very little variation in the y-axis. So, was assumed that all the series don’t have a seasonal component.

Modeling

In this step will be used the ARIMA models, these classic models are known for your robust behavior, and are applicable to the most diverse kinds of time series.

The hyperparameters of ARIMA models will be tuning in regard to the AIC metric. The AIC (Akaike Information Criterion) represent how well the series will fit. The less the value, the better.

TS USA

# time series visualization
ts_plot(ts_USA)
# train and test
ts_USA_split <- ts_split(ts_USA, sample.out = 10)
train_usa <- ts_USA_split$train
test_usa <- ts_USA_split$test

# diagnostic with ACF and PACF
# ACF e PACF to check for autocorrelation
par(mfrow = c(1, 2))
acf(train_usa)
pacf(train_usa)

Each bar in the ACF plot represents the level of correlation between the series and its lags in chronological order. The blue dotted lines indicate whether the level of correlation between the series and each lag is significant or not.

The PCF plot is similar, the unique difference is for considering also the periods before the actual lag.

The bars between the dotted lines meaning that we can reject the hypothesis that there’s autocorrelation for that lag.

For the USA, the ACF plot is showing that there is autocorrelation and PACF is showing that there’s a high probability that the data is not correlated.

# Seed for reproducibility
set.seed(123)

# hyperparametrs
p <- q <- P <- Q <- 0:2

arima_grid <- expand.grid(p, q, P, Q)
names(arima_grid) <- c("p", "q", "P", "Q")
arima_grid$d <- 1
arima_grid$D <- 1

arima_grid$k <- rowSums(arima_grid)

# grid search table
arima_grid <- arima_grid %>% filter(k <= 7)

# iter over the grid search table
arima_search <- lapply(1:nrow(arima_grid), function(i) {
  md <- NULL
  md <- arima(train_usa, order = c(arima_grid$p[i], 1, arima_grid$q[i]),
              seasonal = list(order = c(arima_grid$P[i], 1, arima_grid$Q[i])),
              method = "ML")
  results <- data.frame(p = arima_grid$p[i], d = 1, q = arima_grid$q[i],
                        p = arima_grid$P[i], D = 1, Q = arima_grid$Q[i],
                        AIC = md$aic)
}) %>% bind_rows() %>% arrange(AIC)


head(arima_search)
##   p d q p.1 D Q       AIC
## 1 0 1 1   0 1 0 -295.5225
## 2 0 1 0   0 1 1 -295.5225
## 3 0 1 1   1 1 0 -293.5345
## 4 1 1 1   0 1 0 -293.5345
## 5 1 1 0   0 1 1 -293.5345
## 6 0 1 0   1 1 1 -293.5345
best_model_ts_USA <- arima(train_usa, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 0)))

best_model_test_fc <- forecast(best_model_ts_USA, h = 10)

accuracy(best_model_test_fc, test_usa)
##                         ME       RMSE        MAE         MPE     MAPE      MASE
## Training set  0.0002214735 0.01160489 0.00938163 -0.01242196 1.237011 0.8166278
## Test set     -0.0073651074 0.01684445 0.01244563 -0.73431150 1.244776 1.0833347
##                      ACF1 Theil's U
## Training set -0.002494444        NA
## Test set      0.731128842  1.481552
test_forecast(ts_USA,
              forecast.obj = best_model_test_fc,
              test = test_usa)
# final step
final_md <- arima(ts_USA, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 0)))

ts_USA_forecast <- forecast(final_md, h = 10)

plot_forecast(ts_USA_forecast)

TS MEX

# time series visualization
ts_plot(ts_MEX)
# train and test
ts_MEX_split <- ts_split(ts_MEX, sample.out = 10)
train_mex <- ts_MEX_split$train
test_mex <- ts_MEX_split$test

# diagnostic with ACF and PACF func
# ACF e PACF to check for autocorrelations
par(mfrow = c(1, 2))
acf(train_mex)
pacf(train_mex)

The behavior of the Mexico time series is similar to the USA, the difference is that across the years the series has been showing a negative correlation with the lags.

But, the PACF stay without apparent autocorrelation.

# using the same grid search table as before
# iterate over the grid search table
arima_search <- lapply(1:nrow(arima_grid), function(i) {
  md <- NULL
  md <- arima(train_mex, order = c(arima_grid$p[i], 1, arima_grid$q[i]),
              seasonal = list(order = c(arima_grid$P[i], 1, arima_grid$Q[i])),
              method = "ML")
  results <- data.frame(p = arima_grid$p[i], d = 1, q = arima_grid$q[i],
                        p = arima_grid$P[i], D = 1, Q = arima_grid$Q[i],
                        AIC = md$aic)
}) %>% bind_rows() %>% arrange(AIC)

head(arima_search)
##   p d q p.1 D Q       AIC
## 1 0 1 1   0 1 0 -189.2815
## 2 0 1 0   0 1 1 -189.2815
## 3 0 1 2   0 1 0 -187.5147
## 4 0 1 0   0 1 2 -187.5147
## 5 1 1 1   0 1 0 -187.4391
## 6 0 1 1   1 1 0 -187.4391
best_model_ts_MEX <- arima(train_mex, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 0)))

best_model_test_fc <- forecast(best_model_ts_MEX, h = 10)

accuracy(best_model_test_fc, test_mex)
##                        ME       RMSE        MAE        MPE     MAPE      MASE
## Training set -0.005401946 0.03393699 0.02729705 -0.4490029 2.391014 0.9854180
## Test set     -0.020989575 0.03142131 0.02481903 -2.2047011 2.582690 0.8959623
##                    ACF1 Theil's U
## Training set 0.03406867        NA
## Test set     0.34116392  1.121388
test_forecast(ts_MEX,
              forecast.obj = best_model_test_fc,
              test = test_mex)
# final step
final_md <- arima(ts_MEX, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 0)))

ts_MEX_forecast <- forecast(final_md, h = 10)

plot_forecast(ts_MEX_forecast)

TS CAN

# time series visualization
ts_plot(ts_CAN) 
# train and test
ts_CAN_split <- ts_split(ts_CAN, sample.out = 10)
train_can <- ts_CAN_split$train
test_can <- ts_CAN_split$test

# diagnostic with ACF and PACF func
# ACF e PACF to check for auto-correlations
par(mfrow = c(1, 2))
acf(train_can)
pacf(train_can)

The ACF plot to CANADA time series is dropping faster down the blue dotted line and seems to initiate with a negative autocorrelation.

In the PACF, all the meaningful information stays between the blue dotted lines.

# using the same grid search table as before
# iterate over the grid search table
arima_search <- lapply(1:nrow(arima_grid), function(i) {
  md <- NULL
  md <- arima(train_can, order = c(arima_grid$p[i], 1, arima_grid$q[i]),
              seasonal = list(order = c(arima_grid$P[i], 1, arima_grid$Q[i])),
              method = "ML")
  results <- data.frame(p = arima_grid$p[i], d = 1, q = arima_grid$q[i],
                        p = arima_grid$P[i], D = 1, Q = arima_grid$Q[i],
                        AIC = md$aic)
}) %>% bind_rows() %>% arrange(AIC)

head(arima_search)
##   p d q p.1 D Q       AIC
## 1 0 1 1   0 1 2 -249.1999
## 2 0 1 2   0 1 1 -249.1999
## 3 2 1 0   1 1 1 -248.7422
## 4 2 1 1   1 1 0 -248.7422
## 5 1 1 0   2 1 1 -248.7422
## 6 1 1 1   2 1 0 -248.7422
best_model_ts_CAN <- arima(train_can, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 2)))

best_model_test_fc <- forecast(best_model_ts_CAN, h = 10)

accuracy(best_model_test_fc, test_can)
##                        ME       RMSE        MAE        MPE     MAPE      MASE
## Training set -0.002632124 0.01771258 0.01319837 -0.2745477 1.344406 0.8581564
## Test set     -0.084149958 0.09637938 0.08414996 -8.7974533 8.797453 5.4714214
##                     ACF1 Theil's U
## Training set -0.01801727        NA
## Test set      0.69306916  6.311877
test_forecast(ts_CAN,
              forecast.obj = best_model_test_fc,
              test = test_can)
# final step
final_md <- arima(ts_CAN, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 2)))

ts_CAN_forecast <- forecast(final_md, h = 10)

plot_forecast(ts_CAN_forecast)

Conclusions

All the forecast values can be find in the objects: ts_CAN_forecast, ts_MEX_forecast and ts_USA_forecast. Sadly, the confidence interval was very wide as long as the year becomes too distant to the actual value. This can be fixed by collecting more data.

Between the three predictions, the most precise was the forecast for the USA RTFP\(^{NA}\), which had a tight confidence interval.